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The network characteristics based on the phonological similarities in the lexicons of several lan- 
guages were examined. These languages differed widely in their history and linguistic structure, but 
commonalities in the network characteristics were observed. These networks were also found to be 
different from other networks studied in the literature. The properties of these networks suggest ex- 
planations for various aspects of linguistic processing and hint at deeper organization within human 
language. 
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I. INTRODUCTION 

The results of numerous graph-theoretic analyses sug- 
gest that a number of principles may influence the emer- 
gent structures found in a wide variety of complex sys- 
tems, including information, social, technological, and 
biological networks [TJ [2J [3]- These unifying character- 
istics include small- world properties, distinct community 
structure, and scale-free distributions of the network con- 
nectivity. 

Many aspects of language can be examined from a net- 
work perspective as well. Numerous studies have been 
conducted on semantic networks, where relationships in 
meaning have been made between words. These are of- 
ten based on thesauri, word-associations in corpori or 
from academic databases [H [S]. In addition, linguistic 
networks have been made from orthographic similari- 
ties of words (how words are spelled) [B]- Lastly, lan- 
guage can be viewed from the sounds of words (their 
phonological structure), where words that sound simi- 
lar are neighbors. Although previous experiments have 
examined small portions of phonological networks (near- 
est neighbors of words) in the context of psycholinguistic 
theories of spoken word recognition [7], the first graph- 
theoretic analysis of an entire language network only ap- 
peared more recently [S]. 

In these phonological networks, words in a language are 
represented as vertices or nodes, and an edge is placed 
between them if the words sound similar to each other 
(differing only by a single phoneme, or sound segment). 
For example, as shown in Figure [T] vertices represent- 
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FIG. 1: A phonological network for five English words. 



ing the words hand, send, sad, and, and stand would all 
have edges connecting them to the vertex for the word 
sand. These phonological networks are especially intrigu- 
ing to examine because psycholinguistic studies suggest 
that several characteristics of the network influence cog- 
nitive processing, such as word recognition and retrieval 
EllSI- 
In examining English, Vitevitch |5] found that its 
phonological network had a small giant component (the 
largest connected portion of the graph), with many other 
smaller components (" islands" ) . This property is distinct 
from other complex networks observed in the literature. 
In addition, the degree distribution (the distribution of 
the number of edges per node) was not well modeled by 
a scale-free distribution, or a power law. 

Here, we wanted to explore the generality of these re- 
sults, by doing the first comparative study of multiple 
languages, using phonological networks. We examined 
some of the properties looked at by Vitevitch in English, 
as well as a number of others, and found that phono- 
logical networks all have certain properties distinct from 
other types of complex networks (such as biological and 
social networks). 
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II. METHODS 

The network structure of selected languages was ex- 
amined to determine the generality of the network char- 
acteristics previously observed in English [5]. In addi- 
tion to English, the following languages were examined: 
Spanish, Mandarin, Hawaiian, and Basque (see Table [I]). 
Similar network characteristics across a variety of lan- 
guages might hint toward principles that are common to 
all languages, whereas differences in network measures 
might provide a quantitative way to describe and cate- 
gorize the languages of the world. 

English is an Indo-European language from the Ger- 
manic branch, whereas Spanish comes from the Romance 
branch of the Indo-European family of languages. Man- 
darin, a Sino-Tibetan language, differs from English, 
Spanish, Hawaiian and Basque in that it also uses tones 
to convey word meanings (e.g., "fan" with a high level 
tone means sail, with a rising tone means trouble, with a 
dipping tone means turn, and with a falling tone means 
rice). Tone was not included in the phonological tran- 
scriptions, however. Hawaiian is an Austroncsian lan- 
guage with a phoneme inventory (the number of conso- 
nants and vowels in the language) that is smaller than 
those found in English, Spanish, Mandarin, and Basque. 
Finally, Basque (or Euskara) is a linguistic isolate, mean- 
ing that it is not (or has not yet been identified as) a 
member of a given language family. Additional differ- 
ences, such as those in morphology, exist among the lan- 
guages that were selected for the present network analy- 
ses. 

The phonological networks were constructed from a 
variety of sources. The English network contained the 
words from the Merriam- Webster Pocket Dictionary from 
1964; this database has been used extensively in psy- 
cholinguistic studies [7] . The Hawaiian network was cre- 
ated in a similar manner using a Hawaiian Dictionary 
[lOj . The words from the Spanish network consisted of 
the words in the LEXESP database jTT] , a large Spanish 
language corpus. The words in the Basque network were 
obtained in a manner similar to the words in the Span- 
ish network |12j . The Mandarin network uses the words 
from a database compiled in |13j . 



III. RESULTS 
A. Unique Characteristics of the Giant Component 

1. Giant Component Size 

The giant component sizes of the language networks 
were much smaller compared to other network structures 
discussed in the literature. Typically, the giant com- 
ponent contains approximately 80 — 90% of the vertices 
14J. However, in the present networks, the proportion of 
vertices in the giant component was much smaller, with 



some networks having less than 50% of the vertices in the 
giant component. The proportion of vertices in the gi- 
ant components for comparably sized random networks, 
containing 70 — 80% of the vertices, are also larger than 
the values for the language networks [TS]- This difference 
in giant component size suggests that these phonological 
networks may be more robust to node removal due to 
more tightly connected components, and indicates the 
prevalence of smaller components in the networks. 



2. Robustness to Vertex Removal 

To evaluate the robustness of the networks, vertices 
were removed in two ways: at random, and in decreasing 
order by degree (number of edges connected to a ver- 
tex) . These results are shown in Figure [2] In scale-free 
networks, when vertices are randomly removed the mean 
shortest path length remains constant, whereas when ver- 
tices are removed in order of degree, the mean shortest 
path length increases dramatically [3]. In the language 
networks, however, both methods of node removal re- 
sulted in little to no change in the mean shortest path 
lengths. The shortest path lengths were calculated using 
a sampling technique where 1, 000 nodes were chosen at 
random. Then, the distance to all other nodes (if part 
of the same component) were obtained and these paths 
lengths were then all averaged, to give an estimate of the 
shortest path length. This sped up the calculations con- 
siderably. The extraordinary amount of robustness ob- 
served based on these common methods of node removal 
does seem intriguing and merits further examination. 

3. Assortative Mixing 

In addition, we examined the assortative mixing by 
degree of the language networks, which is a measure of 
the correlation of degree between neighboring nodes. As 
seen in Table [I] all of the language networks had large 
and positive correlations of the degrees of connected ver- 
tices, indicating that high degree vertices tended to be 
connected to each other. Newman [TB] discussed how net- 
works with assortative mixing by degree are more robust 
to vertex removal and percolate more easily (i.e., diseases 
or information spread easily) than networks with disas- 
sortative mixing. The high assortative mixing observed 
in the phonological networks is distinct from other types 
of networks: biological and technological networks often 
are disassortatively mixed, and social networks, which 
display assortative mixing, still have lower values of as- 
sortative mixing. Typical measures of assortativity for 
social networks are 0.1 — 0.3, and biological and tech- 
nological networks are —0.1 to —0.2 [TBJ. On the other 
hand, phonological networks can be higher than 0.7. 

High assortative mixing not only suggests robustness in 
the phonological networks, and highlights the resilience 
of lexical processing in the face of injury to the language 
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TABLE I: Summary information of phonological networks in several languages. GC stands for Giant Component and RN 
stands for Random Network. 





English 


Spanish 


Mandarin 


Hawaiian 


Basque 


Network Size (number of words) 


19,323 


122,066 


30,086 


2,578 


99,321 


Giant Component Size (percentage) 


6,498 (0.34) 44,833 (0.37) 


19,712 (0.66) 


1,406 (0.55) 


35,173 (0.: 


Assortative Mixing by Degree (r) 


0.657 


0.762 


0.654 


0.556 


0.719 


Average Shortest Path Length 


2.7 


4.3 


6.5 


3.2 


4.4 


Average Shortest Path Length (GC) 


6.1 


10.3 


10.1 


5.5 


10.4 


Average Shortest Path Length of RN (using 


; GC) 5.8 


9.9 


7.3 


5.8 


11.4 


Clustering Coefficient 


0.284 


0.191 


0.383 


0.241 


0.206 


Clustering Coefficient of RN 


8.35e-5 


1.17o-5 


8.55e-5 


7.40o-4 


1.21c-5 


Transitivity 


0.313 


0.250 


0.404 


0.260 


0.232 


Ratio of Edges to Vertices 


1.61 


1.43 


2.57 


1.91 


1.21 


Ratio of Edges to Vertices (GC) 


4.55 


2.95 


3.88 


3.44 


2.50 
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FIG. 2: An example run of node removal in English, either 
random or in a targeted fashion (in order by degree). Up 
to 5% of the nodes were removed, and all languages showed 
similar patterns to the above results. In addition, when the 
simulations were done only for the giant component, a similar 
constant, though elevated, value of the average shortest path 
length was found. 



related areas of the brain (i.e., stroke), but it also has 
implications for the searchability of the phonological net- 
works under intact conditions |17j . This feature of the 
phonological network may contribute to the high rates of 
accuracy with which words are retrieved from the mental 
lexicon; one study estimated that healthy adult speakers 
make an error between 0.1 — 0.2% of the time they speak 
[18] . Lexical processing might proceed more slowly and 
errors in word retrieval might be more common if the 
phonological networks did not have such a robust struc- 
ture. The phonological networks of patients with aphasia 
or other neurogenic disorders that disrupt language pro- 
cessing could be used to test this hypothesis. 



B. Small-world properties 

Although the languages differ in their history and lin- 
guistic characteristics, they all share a number of simi- 
larities in their network structure. An important com- 
monality across the languages is that they all have the 
properties of a small-world network 19], that is, a high 
clustering coefficient and short vertex-to- vertex distance. 



The clustering coefficient can be calculated for each node 
(the average value of which is reported above in Table [TJ> , 
and is the fraction of neighbors of a given node that are 
neighbors with each other. It is also known as network 
density. The vertex-to- vertex distance, also known as the 
shortest path length, is the shortest number of hops in a 
network to go from one node to another. Since these net- 
works have many components, the shortest path length 
from one node to another is only calculated for nodes 
that are in the same component 3 . In addition, the 
mean shortest path length was calculated just within the 
giant component of each language. 

As seen in Table [T] the values for the clustering coef- 
ficient are many orders of magnitude larger than what 
would be expected from a comparably sized random net- 
work — a network with the same number of nodes and 
edges — which can be calculated analytically [19] . The 
values of the clustering coefficient are also comparable to 
a similar measure referred to as transitivity, which is a 
more global measure of clustering [S]- 

On the other hand, the mean shortest path length of 
the language networks giant component, calculated us- 
ing a random sample of 1, 000 nodes, was similar to the 
mean shortest path length for comparably sized random 
networks, and significantly shorter than the overall num- 
ber of nodes in the network, as seen in Table |l] P3] . The 
statistics of the giant component were used for compara- 
ble random networks, because the overall ratio of edges 
to nodes is far lower than within the giant component, 
due to the large number of islands in the networks. 

Since a small world structure is often a prerequisite for 
rapid search, and it is well-known that lexical retrieval 
processes are rapid and robust, it would be logical that 
the networks might be optimally structured for search. 
A clear future research direction is the examination of 
these networks for the properties, such as those discussed 
in Kleinberg [20 , that allow for rapid and robust search. 

However, it must be noted that, unlike in social net- 
works, where it is clear what a distance of three friends is, 
for example, it is not entirely clear what the qualitative 
difference is between a distance of 5 and 6 within phono- 
logical networks. This is important when looking at the 
average shortest path lengths of the giant components of 
the different language networks. For instance, is it rel- 
evant that this value for Mandarin (10.1) is twice that 
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TABLE II: Languages and best fit parameters for a truncated 
power law. All fits had p- values of less than 10~ 10 . 

Language Exponent (a) Cutoff (z c ) 



English 


0.826 


16.14 


Spanish 


0.815 


7.06 


Mandarin 


-1.0 


3.69 


Hawaiian 


0.270 


7.34 


Basque 


0.575 


4.56 



of Hawaiian (5.5)? While it is likely that this number is 
most relevant relative to the size of the entire network 
(they are all orders of magnitude smaller than the size 
of the lexica examined), these differences might hint at 
more significant distinctions between the languages ex- 
amined. 

The common occurrence of the small world property 
in networks observed may suggest that it is less a rele- 
vant property of language than simply an indicator that 
language is a fairly organic, unplanned construct. It is in- 
teresting, however, that the path length within a network 
appears to be an important property for language pro- 
cessing. A recent study |21j demonstrated that a measure 
related to path length in a phonological network (i.e., the 
minimum number of substitution, insertion, or deletion 
operations required to turn one word into another) in- 
fluenced pronunciation times in visual word recognition 
tasks. Therefore, the relevance of different average path 
length across languages warrants further investigation. 

C. Degree Distribution 

The degree distributions of scale-free networks obey a 
power law function, P{z) ~ z~ a . In contrast to many 
observed networks, we find that the language networks 
deviate from this behavior. Instead, they are reason- 
ably fit to truncated power laws, similar to scientific co- 
authorship networks [2], as seen in Table [LT| A truncated 
power law, or a power law with an exponential cutoff, is 
defined as follows: 

P(z) ~ z- a e- z / z " (1) 

Table |ll] shows the parameters of the best fit of a trun- 
cated power law for the degree distribution of each lan- 
guage, as calculated by the methods found in Clauset et 
al. [25]. All fits had p- values of less than 1CP 10 , in terms 
of the probability that they were better fit by a truncated 
power law than a traditional power law. In addition, as 
can be seen, Mandarin's fit is essentially an exponential 
distribution, with no power-law portion. 

Amaral et al. [23] found that if there is a constraint 
associated with the attachment of a new vertex (i.e., the 
vertex may only be able to accommodate a fixed num- 
ber of edges), then a power law degree distribution, like 
that in the scale-free model proposed by Barabsi and Al- 
bert [53], is not likely to be observed. In the language 
networks, a variety of constraints on word formation are 




FIG. 3: The degree distributions of two of the language net- 
works (English and Spanish), on a log-log scale. The final 
point for each distribution was not plotted, for legibility. 

present, such as the number of phonemes in the inventory 
of the language, the sequential arrangement of phonemes 
in words, the length of words, and the extent to which 
the language relics on morphemes (the smallest mean- 
ingful unit). All of these constraints limit the number of 
words that might be phonologically similar. Therefore, a 
truncated power law or similar distributions that decay 
faster than a traditional power law are reasonable as fits 
for the degree distributions in phonological networks. 

IV. CONCLUSION 

The phonological networks of a variety of languages 
show a unique structure not found in other complex net- 
works described in the literature. Despite coming from 
a diverse range of language families the networks all ex- 
hibited a common set of properties. Notably, the degree 
distribution is found to lie somewhere between a power 
law and an exponential distribution. 

Furthermore, a small-world structure was observed, in 
conjunction with the distinguishing characteristic of the 
giant components as far smaller than typically observed. 
The small sizes of the giant component together with the 
strong assortative mixing by degree and the robustness 
of the network to the removal of vertices is suggestive 
into the resilience of language processing in the brain, 
although further study is necessary. 

Together, these observed characteristics hint at some 
deeper organization within language. Despite surface dif- 
ferences among languages, there are important common- 
alities that have implications for the processing of lan- 
guage in humans. The intriguing characteristics of these 
networks merit further investigation from network scien- 
tists as well as psycholinguistic researchers. 
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